Comparing a statistical and a rule-based tagger for German

نویسندگان

Martin Volk

Gerold Schneider

چکیده

In this paper we present the results of comparing a statistical tagger for German based on decision trees and a rule-based Brill-Tagger for German. We used the same training corpus (and therefore the same tag-set) to train both taggers. We then applied the taggers to the same test corpus and compared their respective behavior and in particular their error rates. Both taggers perform similarly with an error rate of around 5%. From the detailed error analysis it can be seen that the rule-based tagger has more problems with unknown words than the statistical tagger. But the results are opposite for tokens that are many-ways ambiguous. If the unknown words are fed into the taggers with the help of an external lexicon (such as the Gertwol system) the error rate of the rule-based tagger drops to 4.7%, and the respective rate of the statistical taggers drops to around 3.7%. Combining the taggers by using the output of one tagger to help the other did not lead to any further improvement. In diesem Beitrag beschreiben wir die Resultate aus unserem Vergleich eines statistischen Taggers, der auf Entscheidungsbaumen basiert, und eines regel-basierten BrillTaggers f ur das Deutsche. Beim Vergleich benutzten wir dasselbe Trainingskorpus (und damit dasselbe Tagset), um beide Tagger zu trainieren. Danach wurden beide Tagger auf dasselbe Testkorpus angewendet und ihr jeweiliges Verhalten und ihre Fehlerraten verglichen. Beide Tagger liegen ungef ahr bei 5% Fehlerrate. Bei der detaillierten Fehleranalyse sieht man, dass der regel-basierte Tagger grossere Probleme bei unbekannten Wortformen hat als der statistische Tagger. Bei vielfach ambigen Wortformen ist das Ergebnis jedoch umgekehrt. Wenn man die unbekannten Wortformen mit Hilfe eines externen Lexikons (z.B. mit dem Gertwol-System) reduziert, sinkt die Fehlerrate des regel-basierten Taggers auf 4,7% und die entsprechende Rate des statistischen Taggers auf 3,7%. Eine Kombination der Tagger, der Output des einen als Hilfestellung f ur den anderen, brachte keine weitere Verbesserung.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

High-Performance Tagging on Medical Texts

We ran both Brill’s rule-based tagger and TNT, a statistical tagger, with a default German newspaper-language model on a medical text corpus. Supplied with limited lexicon resources, TNT outperforms the Brill tagger with state-of-the-art performance figures (close to 97% accuracy). We then trained TNT on a large annotated medical text corpus, with a slightly extended tagset that captures certai...

متن کامل

Adding manual constraints and lexical look-up to a brill-tagger for German

We have trained the rule-based Brill-Tagger for German. In this paper we show how the tagging performance improves with increasing corpus size. Training over a corpus of only 28'500 words results in an error rate of around 5% for unseen text. In addition we demonstrate that the error rate can be reduced by looking up unknown words in an external lexicon, and by manually adding rules to the rule...

متن کامل

Comparing a Linguistic and a Stochastic Tagger

Concerning different approaches to automatic PoS tagging: EngCG-2, a constraintbased morphological tagger, is compared in a double-blind test with a state-of-the-art statistical tagger on a common disambiguation task using a common tag set. The experiments show that for the same amount of remaining ambiguity, the error rate of the statistical tagger is one order of magnitude greater than that o...

متن کامل

Tagging French - comparing a statistical and a constraint-based method

In this paper we compare two competing approaches to part-of-speech tagging, statistical and constraint-based disambiguation, using French as our test language. We imposed a time limit on our experiment: the amount of time spent on the design of our constraint system was about the same as the time we used to train and test the easy-to-implement statistical model. We describe the two systems and...

متن کامل

Comparing Earnings Management in Germany and the USA

This study presents empirical evidence concerning the effect of different accounting standard on earnings management. Prior studies have shown that accounting standards influence earnings management. Tighter accounting standards regime restricts management’s descretion to manipulate accruals, and at the same time, induce more costly real earnings management activities. To investigate this iss...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره cs.CL/9811016 شماره

صفحات -

تاریخ انتشار 1998

Comparing a statistical and a rule-based tagger for German

نویسندگان

چکیده

منابع مشابه

High-Performance Tagging on Medical Texts

Adding manual constraints and lexical look-up to a brill-tagger for German

Comparing a Linguistic and a Stochastic Tagger

Tagging French - comparing a statistical and a constraint-based method

Comparing Earnings Management in Germany and the USA

عنوان ژورنال:

اشتراک گذاری